461 research outputs found
Video Stream Retrieval of Unseen Queries using Semantic Memory
Retrieval of live, user-broadcast video streams is an under-addressed and
increasingly relevant challenge. The on-line nature of the problem requires
temporal evaluation and the unforeseeable scope of potential queries motivates
an approach which can accommodate arbitrary search queries. To account for the
breadth of possible queries, we adopt a no-example approach to query retrieval,
which uses a query's semantic relatedness to pre-trained concept classifiers.
To adapt to shifting video content, we propose memory pooling and memory
welling methods that favor recent information over long past content. We
identify two stream retrieval tasks, instantaneous retrieval at any particular
time and continuous retrieval over a prolonged duration, and propose means for
evaluating them. Three large scale video datasets are adapted to the challenge
of stream retrieval. We report results for our search methods on the new stream
retrieval tasks, as well as demonstrate their efficacy in a traditional,
non-streaming video task.Comment: Presented at BMVC 2016, British Machine Vision Conference, 201
Infinite Class Mixup
Mixup is a widely adopted strategy for training deep networks, where
additional samples are augmented by interpolating inputs and labels of training
pairs. Mixup has shown to improve classification performance, network
calibration, and out-of-distribution generalisation. While effective, a
cornerstone of Mixup, namely that networks learn linear behaviour patterns
between classes, is only indirectly enforced since the output interpolation is
performed at the probability level. This paper seeks to address this limitation
by mixing the classifiers directly instead of mixing the labels for each mixed
pair. We propose to define the target of each augmented sample as a uniquely
new classifier, whose parameters are a linear interpolation of the classifier
vectors of the input pair. The space of all possible classifiers is continuous
and spans all interpolations between classifier pairs. To make optimisation
tractable, we propose a dual-contrastive Infinite Class Mixup loss, where we
contrast the classifier of a mixed pair to both the classifiers and the
predicted outputs of other mixed pairs in a batch. Infinite Class Mixup is
generic in nature and applies to many variants of Mixup. Empirically, we show
that it outperforms standard Mixup and variants such as RegMixup and Remix on
balanced, long-tailed, and data-constrained benchmarks, highlighting its broad
applicability.Comment: BMVC 202
Objects2action: Classifying and localizing actions without any video example
The goal of this paper is to recognize actions in video without the need for
examples. Different from traditional zero-shot approaches we do not demand the
design and specification of attribute classifiers and class-to-attribute
mappings to allow for transfer from seen classes to unseen classes. Our key
contribution is objects2action, a semantic word embedding that is spanned by a
skip-gram model of thousands of object categories. Action labels are assigned
to an object encoding of unseen video based on a convex combination of action
and object affinities. Our semantic embedding has three main characteristics to
accommodate for the specifics of actions. First, we propose a mechanism to
exploit multiple-word descriptions of actions and objects. Second, we
incorporate the automated selection of the most responsive objects per action.
And finally, we demonstrate how to extend our zero-shot approach to the
spatio-temporal localization of actions in video. Experiments on four action
datasets demonstrate the potential of our approach
Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks
How can we reuse existing knowledge, in the form of available datasets, when
solving a new and apparently unrelated target task from a set of unlabeled
data? In this work we make a first contribution to answer this question in the
context of image classification. We frame this quest as an active learning
problem and use zero-shot classifiers to guide the learning process by linking
the new task to the existing classifiers. By revisiting the dual formulation of
adaptive SVM, we reveal two basic conditions to choose greedily only the most
relevant samples to be annotated. On this basis we propose an effective active
learning algorithm which learns the best possible target classification model
with minimum human labeling effort. Extensive experiments on two challenging
datasets show the value of our approach compared to the state-of-the-art active
learning methodologies, as well as its potential to reuse past datasets with
minimal effort for future tasks
Learning to Rank and Quadratic Assignment
In this paper we show that the optimization of several ranking-based performance measures, such as precision-at-k and average-precision, is intimately related to the solution of quadratic assignment problems. Both the task of test-time prediction of the best ranking and the task of constraint generation in estimators based on structured support vector machines can all be seen as special cases of quadratic assignment problems. Although such problems are in general NP-hard, we identify a polynomially-solvable subclass (for both inference and learning) that still enables the modeling of a substantial number of pairwise rank interactions. We show preliminary results on a public benchmark image annotation data set, which indicates that this model can deliver higher performance over ranking models without pairwise rank dependencies
3D Neighborhood Convolution: Learning Depth-Aware Features for RGB-D and RGB Semantic Segmentation
A key challenge for RGB-D segmentation is how to effectively incorporate 3D
geometric information from the depth channel into 2D appearance features. We
propose to model the effective receptive field of 2D convolution based on the
scale and locality from the 3D neighborhood. Standard convolutions are local in
the image space (), often with a fixed receptive field of 3x3 pixels. We
propose to define convolutions local with respect to the corresponding point in
the 3D real-world space (), where the depth channel is used to adapt
the receptive field of the convolution, which yields the resulting filters
invariant to scale and focusing on the certain range of depth. We introduce 3D
Neighborhood Convolution (3DN-Conv), a convolutional operator around 3D
neighborhoods. Further, we can use estimated depth to use our RGB-D based
semantic segmentation model from RGB input. Experimental results validate that
our proposed 3DN-Conv operator improves semantic segmentation, using either
ground-truth depth (RGB-D) or estimated depth (RGB)
- …